A universal null-distribution for topological data analysis

One of the most elusive challenges within the area of topological data analysis is understanding the distribution of persistence diagrams arising from data. Despite much effort and its many successful applications, this is largely an open problem. We present a surprising discovery: normalized properly, persistence diagrams arising from random point-clouds obey a universal probability law. Our statements are based on extensive experimentation on both simulated and real data, covering point-clouds with vastly different geometry, topology, and probability distributions. Our results also include an explicit well-known distribution as a candidate for the universal law. We demonstrate the power of these new discoveries by proposing a new hypothesis testing framework for computing significance values for individual topological features within persistence diagrams, providing a new quantitative way to assess the significance of structure in data.


Detailed Previous Work
We first begin with a more complete review of relevant applications and related work. Persistent homology has been applied successfully in numerous applications developed over the last decade, in areas such as neuroscience [5,27,29,44], medicine and biology [15,21,37,38], material sciences [31,34,36,40,43], dynamical systems [35,52,55], and cosmology [3,45,54]. A fundamental challenge in these applications is to determine quantitatively how significant topological features are. In the above work, this is often done qualitatively or with respect to an application-specific hypothesis/model rather than a general methodology.
From a theoretical point of view, the study closest in spirit to our current work is [32], where persistence diagrams were considered as random measures. The main result shows that for a wide class of stationary processes there exists a non-random limiting measure. Later [25] proved that these measures have density, and [48] extended the limit to marked point processes. In [42] the limit of persistence diagrams as compact sets was studied, in an extreme value setting. Other related works consider various marginals of persistence diagrams (e.g., Betti numbers, Euler characteristic) and study their limiting behavior in various settings [7,8,10,11,33,41,56,57]. While these results have established a rich mathematical theory for persistence diagrams, translating them into statistical tools has lagged behind. The two main reasons are: (a) these results show that various limits exist, but in most cases without any explicit description, (b) these limits are highly dependent on the distribution generating the point-cloud.
Quite a different approach was taken in [3], where persistence diagrams were modelled as Gibbs measures, whose parameters can be estimated from the data. Once the model parameters are estimated, one can use MCMC algorithms to generate synthetic persistence diagrams. Generating random diagrams allows, for example, to compare different diagrams via hypothesis testing. This framework is quite generic and powerful and was further developed in [43]. However in this type of approach, the resulting distribution varies between different point-clouds, and the parameters need to be estimated for every model separately.
With respect to topological inference, the TDA literature provides various powerful methods, all based on the premise that the distribution of persistence diagram is inaccessible. In [26] the authors introduced confidence intervals for persistence diagrams, overcoming this issue by using sub-sampling methods and the stability of the persistence diagrams [22]. The confidence intervals are computed for an entire diagram, rather than for individual features, and thus are quite coarse. In addition, this approach requires restrictive assumptions on the underlying space (closed manifold), as well as the probability distributions and its derivatives. Other useful methods include distance to measures [17,18], witness complexes [24], persistence landscapes [13] and other vectorization techniques [2,14,38,46] , estimated null distributions [21], multicover bifiltrations [23,47], and other sub-sampling based methods [6,4,19,18,50,1].

Signal vs. Noise
In this section, we provide more details on the assumed decomposition of persistence diagrams into the signal and noise parts. Figure 1 presents a random sample from a figure eight or a wedge of two annuli in R 2 (left), with the corresponding H 1 persistence diagram (right). The signal part of this diagram consists of the two points far above the diagonal, corresponding to the two holes in the figure eight. All other points are much closer to the diagonal, and arise due to the randomness of the sampling, rather than the underlying structure of the space. We therefore consider them as the noise. Note that while the signal points will appear in a different locations if we re-sample the shape, the stability theorem for persistence diagrams [22] ensures that the signal points across different samples will be close to each other. More formally, we adopt the setting from manifold learning and assume the reader is familiar with persistent homology. Let S ⊆ R d be the support of a probability distribution, let X ⊆ S be a random subset, and let K r denote either theČech or Rips complex generated by X at distance r. Note that while the vertex set of K r is a subset of S, there is no immediate way to map H k (K r ) into H k (S). Our first step is thus to define such a mapping that will be natural, in the sense that it preserves the intuitive meaning of inclusion. To do so, we will define a space S r , along with maps i r * : H k (S) → H k (S r ), and j r * : H k (K r ) → H k (S r ). Mapping both H k (S) and H k (K r ) into a common space H k (S r ) we allow us find k-cycles in K r that have a corresponding cycle in S.
For theČech complex, let S r be the the r-offset of S, i.e., all points in R d which are at most distance r away from S. The map i r * is induced by the inclusion S → S r . The map j r * : H k (K r ) → H k (S r ) is induced by the inclusion X r → S r , where X r is the r-offset of X, combined with the Nerve Lemma [12]. For the Rips complex, we take S r to be the (infinite) Rips complex constructed over all points in S, at distance r. From [30], it is known that for small enough ϵ, S and S ϵ are homotopy equivalent. The map i r * is then induced by the inclusion S ϵ → S r , combined with the homotopy equivalence S ≃ S ϵ . The map j r * is induced by the inclusion K r → S r .
We can now define the signal as follows. For any point p ∈ dgm k , let b = birth(p), and let [p] ∈ H k (K b ) be a representative of the homology class of p at its birth time. Note that we do not assume that [p] is unique. We define the signal diagram as In other words, p will be considered as a signal point if it matches a non-trivial cycle in S (via i b * and j b * ). The noise diagram can be then defined as the remaining points dgm N k = dgm k \ dgm S k . Note that additional assumptions are needed if one wants to prove that the signal diagram captures any or all of the homology of S. This has been the topic of numerous studies, from a geometric perspective [16,20,49] and from a probabilistic perspective [11,26].
Remark 1. The definition we provide here for the signal diagram, assumes that the signal is represented by the support of the probability distribution. In other words, we assume that we are able to sample directly from S. This, however, ignores the plausible case, where the sample X is generated around S, rather than directly on it. In such cases, we may consider the signal as coming from a super-level set of the density function (i.e., a sufficiently dense region), rather than the support, which might even be the entire R d . Thus, the precise definition of signal and noise becomes much more complex, and requires additional assumptions and techniques. For example, in [9,39] it was shown that the signal can be recovered by estimating the density function, and then taking the super-level sets of the estimates. While this is an interesting direction of research within the field, we leave it for future work.

Complete experimental results
3.1. Sampling from iid distributions. We first examine the experimental results supporting Conjecture-1 more closely. To this end we examined the settings detailed in Tables 1 and 2. The π-values distributions for theČech filtration are presented in Figure 4 and for the Rips filtration in Figure 5. We note that not all settings in Tables 1 and 2 were computed for both theČech and the Rips complex. The reasons for that are purely computational. Some of the settings required more memory or computing power than we have at our disposal. Tables 3-4 provide the full list of iid-configurations tested, including the sample size used. The varying sample sizes are due to both our wish to explore the behavior at different values of n, and the computational limitation we have for computing the persistence diagrams. In Figures 6-9 Table 2. An exhaustive list of all iid non-uniform distributions tested. The homology dimensions computed (k) are the same as in Table 1. The second column contains the expression of the 1-dimensional density used (up to a normalizing constant).
For details on the iid distributions used, see the Methods section in the paper body, however we include examples of the less standard point clouds in Figure 2. An example for the Vietoris-Rips andČech complexes is presented in Figure 3.    Table 3 Figure 5. CDFs π-values in all iid settings tested, using the Rips complex. The complete list of models included in this figure is in Table 4 5 distribution (P) dimension (d) homology degree (k) sample size (n)    Figure 6. The ℓ-values distribution for theČech complex -part 1/3.

3.2.
Sampling from non-iid distributions. As mentioned in the paper, we tested two examples of non-iid point-clouds -a d-dimensional Brownian motion and a Lorenz dynamical system (see the Methods section for details on how these were generated), see Figure 10 for examples of these point clouds.
The results for Brownian motion can be seen in  In the top two rows of Figures 11-12, we present the distribution of π-values for Brownian point-clouds (in color) compared to a selected baseline from the iid distributions (in gray). We included these figures here to show that the Brownian samples behave very differently from the iid samples. Nevertheless, the bottom two row presents the distribution of the ℓ-values, which seems to follow the same LGumbel distribution as the iid case. All samples here are of size n = 50, 000. Similarly for the Lorenz dynamical system, the results are shown in Figure 13. The sample size for theČech is n = 100, 000, and for the Rips is n = 50, 000.    Figure 13. Sampling the Lorenz system -bothČech and Rips. 15 3.3. Testing on real data. As described in the paper, we also tested the strong-universality conjectures against three examples of real data -patches of natural images, sliding windows of voice recordings, and sentence embeddings.
The results for the Image Patches dataset are presented in Figure 14. In addition to taking the original 8-dimensional point-cloud, we also examined its lower dimensional projections for dimensions d = 3, . . . , 7. The sample size used was n = 50, 000 for all dimensions. Likewise the results for the Takens' embedding of a voice recording are presented in Figure 15. The results for the sentence embeddings is in Figure 16. The Bible has 29,812 sentences, while Moby Dick has 8,762 and Ulysses has 23,641. 3.4. Goodness of fit. As stated briefly in the paper, we wanted to perform a goodness-of-fit test, to provide further evidence for the validity of our conjectures. To this end, we used the two-sided Kolmogorov-Smirnov (KS) test, with the null hypothesis that the distribution of the ℓ-values is LGumbel. We tested all the 210 different settings presented in the previous sections, and for none of them the null hypothesis was rejected (with α = 0.05). Furthermore, it can be observed in Figure 17 that the KS statistic values are concentrated away from the critical value. This is a very positive reinforcement to our conjectures. A few comments are in place here: • The accurate conclusion from this KS test, is not that the ℓ-values have an LGumbel distribution, but rather that our experimental results do not contain sufficient evidence to the contrary. This is, however, the best case scenario we can hope for in such goodness-of-fit testing. • As we mentioned in the second remark following Conjecture 3, our results demonstrate some deviation of the tails which is a consequence of slow convergence rates. The deviation of the tail interferes with the performance of the KS test, especially for large sample sizes. Therefore, we ran the test on  a small (n = 1, 000) random sub-sample of the persistence values. This also ensures that we have almost-iid samples of the distribution (see Section 4), which is a requirement of the KS-test. We should note that the results presented are quite robust, in the sense that in repeated trials (with different sub-samples) we received similar results. • As we are running 210 tests simultaneously, this should be considered as a case of multiple-hypothesis testing. The p-values were therefore adjusted using the Bonferroni correction. Figure 17. Kolmogorov-Smirnov statistics. We present the histogram for the KS statistic values for the entire collection of 210 point-clouds, testing each point-cloud against the Gumbel distribution. We present three independent instances, where in each instance we generated a random sub-sample of n = 1000 ℓ-values to perform the KS-test on. Note that the histogram seems to be robust, and concentrated away from the critical value (for α = 0.05), represented by the vertical red line.

Non-universal examples.
Finally, we present three examples of point-clouds which do not follow the universal behavior described in the paper. They have properties which do not allow for points to be "too close" to each other or have a large number of significant cycles. We first describe the models, and the results can be seen in Figure 18 along with examples of the point clouds in Figure 19.
(1) Ginibre Ensemble: Our first example is the Ginibre ensemble [28]. This is a random point cloud of size n in two dimensions (in the complex plane), where the coordinates of the points are the eigenvalues of an n × n matrix with entries that are i.i.d. uniform random complex values. In the experiment below we chose n = 10, 000. (2) Perturbed Grid: Our second example is a perturbed grid in two dimensions. First taking points on a grid on the square, we perturb the points uniformly at random at 20% of the spacing. The example we show was on a 150 × 150 grid, so n = 22, 500. (3) Bubbles: Our last example is where the number and sizes of empty regions was maximized using gradient descent or flow on point [53], resulting in a "bubble" like structure on a two dimensional torus. We then added a small amount of noise to verify that the point cloud still did not exhibit universal behavior. In the result, n was 10, 000.

Dependency testing
In the paper, our first two conjectures, i.e. Conjectures 1 and 2, are concerned with the limit for the empirical measure of persistence π-values and ℓ-values, respectively. The way we interpret these conjectures is as follows. Given a random sample X n ∼ P n , the persistence diagram dgm k is a random point process in R 2 , i.e. a collection of random points {p 1 , . . . , p M } ⊂ R 2 , where M itself is random. Note that there is no natural ordering to the points in a diagram, and to some extent their ordering is algorithm-specific. Therefore, in our point process model we will assume that the points are ordered completely at random. In practice, given the output of a persistent homology algorithm dgm = {p 1 , . . . , p m }, where p i = (b i , d i ), we will consider the randomly ordered sequence (p i1 , . . . , p im ) where (i 1 , . . . , i m ) is a random permutation of (1, . . . , m). In the following, we will always assume that p 1 , . . . , p m are already randomly ordered. We denote their persistence values by π i = π(p i ) and ℓ i = ℓ(p i ). Given the size of the diagram M = m, one can show that the random variables π 1 , . . . , π m all have the same marginal distribution (the same goes for ℓ 1 , . . . , ℓ m ). This distribution is approximated by the empirical measures we studied earlier (i.e., Π n for π i , and L n for ℓ i ). However, even if we knew what the distribution of π i is exactly, it would not tell us anything about the joint m-dimensional distribution of (π 1 , . . . , π m ), and in particular about the dependency relationship between the different π-values (or ℓ-values) To study the dependency between cycles within a single diagram, we conducted the following experiment. We generated N independent point-clouds, and computed their persistence diagrams, From each diagram we randomly sampled 25 points, denoted p 1 , . . . , p 25 , and placed their corresponding ℓ-values in a vector of the form L = (ℓ 1 , . . . , ℓ 25 ) ∈ R 25 . We took N = 100, 000 such realizations, from which we estimated both the covariance and the distance-covariance [51] matrices. As these experiments are quite costly (computing 100, 000 persistence diagrams for an individual setting), we only examined two cases here -the 2-dimensional box, and the 2-dimensional Brownian sample, in order to compare an iid setting with a non-iid one.
The results are presented in Figure 20. In addition to the results for the box (right), and the Brownian motion (middle), we also included the covariance and distance-covariance matrices, computed from 25 independent LGumbel variables. In addition to the matrices look rather similar, we also include the mean and maximum absolute values form these matrices in Table 5. We note that not only are these values quite low, they are comparable to the values obtained in an empirical covariance matrix obtained by generating iid random variables from the LGumbel distribution.  Table 5. Comparing correlations between ℓ-values in a box, Brownian motion, and iid LGumbel variables. We compute the mean and maximum absolute values for both the correlation matrix and the distance-correlation of the 25 ℓ-values extracted from the persistence diagrams.

Exploring the value of B
Returning to strong universality (Conjecture-2), notice that the only part that depends on the distribution generating the point-cloud is the normalization value B = −λ−AL, where λ is the Euler-Mascheroni constant. SinceL is the average of all log log(π) values (a proxy for their mean value), it will depend on the distribution of the points, as we have seen in Section 3. It is therefore natural to ask about the relationship between B and the model parameters (S, T , k).
In Figure 21 we explore the estimated values of B, showing the means (with error bars) versus the dimension, for different iid settings. As suggested by Conjecture-1, the value of B depends on d, T , k, but is otherwise independent of S. Figure 21 also suggests a linear relationship between B and the sampling dimension (d). A thorough investigation of this relationship remains a future work.
We also include experiments on stratified samples, which are iid but of mixed dimensions (between 2 and 3). For the dimension value (the x-axis), we use the weighted average corresponding to the percentage of points sampled from each stratum. Hence if 20% of the samples come from the 3 dimensional manifold and 80% from the 2 dimensional manifold, we assign a dimension of 2.8 to the sample. We can further see that the values are interpolating between dimension 2 and 3, but this is not a linear relationship. This could be due to the simple estimate of the dimension, as the remaining relationships seem quite linear. The precise relationship is difficult to ascertain due to the limited range of dimensions and we defer it to future work.
In Figures 22 and 23, we show the distributions of our estimates of B, across the various models, using sample size of n = 50, 000. As can be seen in the figures, theČech complex estimates are much more concentrated than the Rips ones, and the distributions are more spread out in higher homological dimensions. We believe that this is due to slower convergence in these cases (in terms of the number of points).
Finally, we wish to explore the connection between estimation variance and the sample size n. In the case of Rips complex, the variance of the estimates of B seems to be quite small in the compact-uniform case (e.g. on a box), compared to the variance emerging when sampling, for example, from the non-compact distribution (normal and Cauchy). In Figure 24, we fix d = 3 and k = 1, and present the distribution for the box compared to the non-compact cases. The sample-size for the box was fixed n = 50, 000, and for the non-compact distribution we tested n = 10k, 20k, 50k, and 100k. We observe that as n increases the distribution of the non-compact cases moves closer to the box case. This serves as an evidence that in the limit all cases have indeed the same value of B (supporting Conjecture-1), it is just that some cases converge slower than others.

Hypothesis Testing Examples
In the paper we discussed our new hypothesis framework, and demonstrated it on the figure-8 example. In this section, we wish to present several more experiments testing this framework. In all our experiments, we set the desired significance level to be α = 0.05. Computing p-values. We start with a toy example to test our p-value computation. We sampled 1,000 points on an annulus in R 2 , whose outer radius is 1 and the inner radius is varied. In this case we expect to detect a single 1-cycle, corresponding to the hole of the annulus. For each value of r we computed the persistence diagram, and counted the number of cycles that are declared as significant (i.e., p-value < α | dgm | ). In Figure 25 on the right we present the results of this experiment with the curve showing the average number of significant cycles detected, over 1,000 repetitions, versus the inner radius of the annulus. In this figure we also show the persistence diagrams (left) of two realizations with different radii -one where the cycle is detected as significant, and one where it is not. For each diagram, the line of significance (corresponding to α = 0.05) is shown (the dotted and dashed lines correspond to the larger and smaller radii respectively). Note that due to different empirical means and different numbers of points in the persistence diagrams, these lines are not the same for the two diagrams, but are very close. The Figures in the middle show the cycles with the largest ℓ-value for R = 0.4 (left, significant) and R = 0.2 (right, non-significant).
In the next experiment, we used a "cut" annulus, i.e., we take an annulus with a strip removed from it (see Figure 26). The resulting space is contractible, and in particular has no signal to be detected. However, the point-cloud generated on the cut-annulus is very similar to that generated on a complete annulus. Indeed, the persistence diagram for points on the cut-annulus tend to have a single 1-cycle with a long lifetime (see Figure 26, left). Our goal is to show that, using our p-value computation, we can determine that despite its long lifetime, this cycle should not be detected as signal. In Figure 26 (middle left and middle right) we present the largest cycles for two different cut-widths. While these cycles look quite similar, the one on the right is not considered statistically significant, due to its later birth time. In Figure 26(right), we show the average number of signal cycles detected in the cut-annulus as a function of the width of the cut. When the width is very small (as in the middle-left figure), it is impossible to distinguish between the annulus and cut-annulus with just 1,000 points. Indeed, in these cases, we often we falsely detect a signal. However, already for relatively small values (around W = 0.05) there are cases where we manage to reject the "fake" signal, and at around W = 0.15 we do so perfectly. In other words, using our p-values we can reject cycles, even with relatively long lifetimes, that would otherwise be falsely detected as signal.  (middle-left) At cut-width W = 0.15 it is difficult to distinguish the cut-annulus from a complete one, and the cycle is detected as signal (p-value = 0.008). (middle-right) A non-significant but persistent cycle (p-value = 0.16), with a cut-width of W = 0.25. The lack of significance is due to the later birth time of the cycle. (right) The p-value curve for a cut-annulus with a function of the cut-width.

Infinite cycles
To demonstrate the power of Algorithm 1, we generated a point-cloud on the 2-dimensional torus, embedded in R 3 . The torus radii are R 1 = 1 and R 2 = 2. Originally, if we wanted to inspect the lifetime of the signal cycles in this example, and possibly to compute a p-value, we need to take the filtration all the way up to r = R 2 . This will result in a large amount of simplices that would require a massive amount of memory and CPU time. Instead, we ran Algorithm 1, which incrementally increases the filtration threshold just enough to be able to determine significance for all the cycles in the filtration. Table 6 provides some insight into how much computational resources we can save by using this method. For varying sample sizes, we ran this procedure and found the largest value of τ needed to classify all cycles as signal or noise. We compare the number of edges in the Rips complex at radius τ with the number of edges at radius 2 (which, without Algorithm 1, is the only way to compute a p-value for both signal cycles). We observe that the amount of edges we are saving is quite large, and is increasing with the sample size. The saving we expect to # of points  Table 6. The number of edges (in 1000s) at the full threshold (r = 2) versus the computed threshold τ versus the number of points. The last column represents the ratio of the number of edges.
achieve in the number of triangles is much higher, however, we were unable to compute the exact numbers due to computational limitations.